AITopics | scientific text

Collaborating Authors

scientific text

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts

Brinner, Marc, Zarriess, Sina

arXiv.org Artificial IntelligenceJul-18-2025

We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model's ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.13105

Country:

Europe (0.67)
North America > United States > New Mexico (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Seidlmayer, Eva, Galke, Lukas, Förstner, Konrad U.

arXiv.org Artificial IntelligenceJul-8-2025

Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.03488

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > New York > New York County > New York City (0.04)
(10 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Media > News (1.00)
Health & Medicine > Therapeutic Area > Vaccines (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

Hofmann, Nico, Dauenhauer, Julian, Dietzler, Nils Ole, Idahor, Idehen Daniel, Kreutz, Christin Katharina

arXiv.org Artificial IntelligenceJul-8-2025

Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.

large language model, machine learning, simplification, (18 more...)

arXiv.org Artificial Intelligence

2507.04414

Country: Europe (1.00)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.67)

Industry:

Health & Medicine > Therapeutic Area (0.48)
Health & Medicine > Consumer Health (0.34)
Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)

Add feedback

Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering

Maiti, Aniruddha, Adewumi, Samuel, Tikure, Temesgen Alemayehu, Wang, Zichun, Sengupta, Niladri, Sukhanova, Anastasiia, Jana, Ananya

arXiv.org Artificial IntelligenceMar-3-2025

This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.

category, dataset, relationship category, (16 more...)

arXiv.org Artificial Intelligence

2503.02032

Country:

North America > United States > West Virginia > Cabell County > Huntington (0.04)
North America > United States > California (0.04)
Asia > China (0.04)

Genre: Research Report > Experimental Study (0.34)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.63)

Add feedback

On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts

Shahi, Gautam Kishore, Hummel, Oliver

arXiv.org Artificial IntelligenceFeb-8-2025

The amount of scholarly texts is consistently increasing; around 2.5 million research articles are published yearly (Rabby et al., 2024). Due to this enormous increase, the classification of (scientific) texts has been attracting even more attention in recent years (Born-mann et al., 2021). Classifying the research area of scientific texts requires significant domain knowledge in various complex research fields. Hence, manual classification is challenging and time-consuming for librarians and limits the number of texts that can be classified manually (Zhang et al., 2023). Moreover, due to complex hierarchical classification schemes and their existing variety, classification of publications is also an unbeloved activity for researchers. Prominent examples of classification schemes include the Open Research Knowledge Graph (ORKG) (Auer and Mann, 2019), Microsoft Academic Graph (Wang et al., 2020), the Semantic Scholar Academic Graph (Kinney et al., 2023), ACM computing classification system (Rous, 2012), Dewey Decimal Classification (DDC) (Scott, 1998), and the ACL Anthology (Bird et al., 2008).

classification, research area, scientific text, (13 more...)

arXiv.org Artificial Intelligence

2502.15745

Country:

Europe > Germany (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

Xie, Tong, Zhang, Hanzhi, Wang, Shaozhou, Wan, Yuwei, Razzak, Imran, Kit, Chunyu, Zhang, Wenjie, Hoex, Bram

arXiv.org Artificial IntelligenceDec-6-2024

Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.

artificial intelligence, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2411.12

Country:

North America > United States > Texas > Denton County > Denton (0.14)
Oceania > Australia > New South Wales > Kensington (0.05)
Asia > China > Hong Kong (0.05)

Genre: Research Report (0.83)

Industry:

Information Technology > Services (0.55)
Energy (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

Rostam, Zhyar Rzgar K, Kertész, Gábor

arXiv.org Artificial IntelligenceNov-27-2024

The exponential growth of online textual content across diverse domains has necessitated advanced methods for automated text classification. Large Language Models (LLMs) based on transformer architectures have shown significant success in this area, particularly in natural language processing (NLP) tasks. However, general-purpose LLMs often struggle with domain-specific content, such as scientific texts, due to unique challenges like specialized vocabulary and imbalanced data. In this study, we fine-tune four state-of-the-art LLMs BERT, SciBERT, BioBERT, and BlueBERT on three datasets derived from the WoS-46985 dataset to evaluate their performance in scientific text classification. Our experiments reveal that domain-specific models, particularly SciBERT, consistently outperform general-purpose models in both abstract-based and keyword-based classification tasks. Additionally, we compare our achieved results with those reported in the literature for deep learning models, further highlighting the advantages of LLMs, especially when utilized in specific domains. The findings emphasize the importance of domain-specific adaptations for LLMs to enhance their effectiveness in specialized text classification tasks.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.00098

Country: Europe > Hungary > Budapest > Budapest (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Steering AI-Driven Personalization of Scientific Text for General Audiences

Kim, Taewook, Agarwal, Dhruv, Ackerman, Jordan, Saha, Manaswi

arXiv.org Artificial IntelligenceNov-15-2024

Digital media platforms (e.g., social media, science blogs) offer opportunities to communicate scientific content to general audiences at scale. However, these audiences vary in their scientific expertise, literacy levels, and personal backgrounds, making effective science communication challenging. To address this challenge, we designed TranSlider, an AI-powered tool that generates personalized translations of scientific text based on individual user profiles (e.g., hobbies, location, and education). Our tool features an interactive slider that allows users to steer the degree of personalization from 0 (weakly relatable) to 100 (strongly relatable), leveraging LLMs to generate the translations with given degrees. Through an exploratory study with 15 participants, we investigated both the utility of these AI-personalized translations and how interactive reading features influenced users' understanding and reading experiences. We found that participants who preferred higher degrees of personalization appreciated the relatable and contextual translations, while those who preferred lower degrees valued concise translations with subtle contextualization. Furthermore, participants reported the compounding effect of multiple translations on their understanding of scientific content. Given these findings, we discuss several implications of AI-personalized translation tools in facilitating communication in collaborative contexts.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.09969

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > New York > New York County > New York City (0.06)
(23 more...)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Energy (1.00)
Media (0.93)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.87)
(2 more...)

Add feedback

KALE-LM: Unleash The Power Of AI For Science Via Knowledge And Logic Enhanced Large Model

Dai, Weichen, Chen, Yezeng, Dai, Zijie, Huang, Zhijie, Liu, Yubo, Pan, Yixuan, Song, Baiyang, Zhong, Chengli, Li, Xinhe, Wang, Zeyu, Feng, Zhuoying, Zhou, Yi

arXiv.org Artificial IntelligenceSep-27-2024

In recent years, the rapid development of artificial intelligence (AI) technology has enabled it to achieve, and in some cases surpass, top human performance in various high-intelligence tasks. These include recognition in speech [1], facial [2], and image [3], games such as Go [4], StarCraft [5], and Dota2 [6], as well as tasks related to text [7], image [8], and video generation, machine translation [9], knowledge-based question answering [10], debates, and solving advanced mathematical problems [11]. Science is one of the most important fields for the application of AI. As the crown jewel of human civilization and the cornerstone of various industries, science is a core driver of human progress, and its development can significantly accelerate and even revolutionize many fields. Historically, there have been three major research paradigms in science: the first paradigm, experiment, which emerged from Newtonian empiricism; the second paradigm, theory, born from Einstein's rationalism; and the third paradigm, simulation/computation, which arose from the third industrial revolution, the computation and information revolution.

knowledge management, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2409.18695

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Europe > Monaco (0.04)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry:

Leisure & Entertainment > Games > Computer Games (0.54)
Materials > Chemicals > Commodity Chemicals > Petrochemicals (0.31)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Towards understanding evolution of science through language model series

Dong, Junjie, Lyu, Zhuoqi, Ke, Qing

arXiv.org Artificial IntelligenceSep-15-2024

We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.

corpus, language model, representation, (15 more...)

arXiv.org Artificial Intelligence

2409.09636

Country: Asia > China > Hong Kong (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback